On Minimax Optimal Offline Policy Evaluation
نویسندگان
چکیده
This paper studies the off-policy evaluation problem, where one aims to estimate the value of a target policy based on a sample of observations collected by another policy. We first consider the multi-armed bandit case, establish a minimax risk lower bound, and analyze the risk of two standard estimators. It is shown, and verified in simulation, that one is minimax optimal up to a constant, while another can be arbitrarily worse, despite its empirical success and popularity. The results are applied to related problems in contextual bandits and fixed-horizon Markov decision processes, and are also related to semi-supervised learning.
منابع مشابه
Toward Minimax Off-policy Value Estimation
This paper studies the off-policy evaluation problem, where one aims to estimate the value of a target policy based on a sample of observations collected by another policy. We first consider the single-state, or multi-armed bandit case, establish a finite-time minimax risk lower bound, and analyze the risk of three standard estimators. For the so-called regression estimator, we show that while ...
متن کاملOn the Minimax Optimality of Block Thresholded Wavelets Estimators for ?-Mixing Process
We propose a wavelet based regression function estimator for the estimation of the regression function for a sequence of ?-missing random variables with a common one-dimensional probability density function. Some asymptotic properties of the proposed estimator based on block thresholding are investigated. It is found that the estimators achieve optimal minimax convergence rates over large class...
متن کاملRegularized Policy Iteration with Nonparametric Function Spaces
We study two regularization-based approximate policy iteration algorithms, namely REGLSPI and REG-BRM, to solve reinforcement learning and planning problems in discounted Markov Decision Processes with large state and finite action spaces. The core of these algorithms are the regularized extensions of the Least-Squares Temporal Difference (LSTD) learning and Bellman Residual Minimization (BRM),...
متن کاملTracking of signal and its derivatives in Gaussian white noise
For the observation model “signal + white Gaussian noise”, an on-line tracking algorithm for signal and its derivatives is proposed. The tracking algorithm applies to a class of signals with derivative up to the k-th order. The asymptotic optimality in the minimax sense, with respect to small intensity of noise, is established.
متن کاملEmpirical Results on Convergence and Exploration in Approximate Policy Iteration
In this paper, we empirically investigate the convergence properties of policy iteration applied to the optimal control of systems with continuous state and action spaces. We demonstrate that policy iteration requires lesser iterations than value iteration to converge, but requires more function evaluations to generate cost-to-go approximations in the policy evaluation step. Two different alter...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1409.3653 شماره
صفحات -
تاریخ انتشار 2014